Generalized Substring Compression

نویسندگان

Orgad Keller

Tsvi Kopelowitz

Shir Landau Feibish

Moshe Lewenstein

چکیده

In substring compression one is given a text to preprocess so that, upon request, a compressed substring is returned. Generalized substring compression is the same with the following twist. The queries contain an additional context substring (or a collection of context substrings) and the answers are the substring in compressed format, where the context substring is used to make the compression more efficient. We focus our attention on generalized substring compression and present the first non-trivial correct algorithm for this problem. In our algorithm we inherently propose a method for finding the bounded longest common prefix of substrings, which may be of independent interest. In addition, we propose an efficient algorithm for substring compression which makes use of range searching for minimum queries. We present several tradeoffs for both problems. For compressing the substring S[i . . j] (possibly with the substring S[α . . β] as a context), best query times we achieve are O(C) and O ( C log ( j−i C )) for substring compression query and generalized substring compression query, respectively, where C is the number of phrases encoded.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient VLSI Architecture for Lossless Data Compression

An architecture for LZ1-type lossless data compression is described. The architecture is area-efficient and fast since it exploits the locality of substring match lengths. The property has been shown experimentally for various data and buffer lengths, and an architecture based on it has been designed.

متن کامل

Finding Synchronization Codes to Boost Compression by Substring Enumeration

Synchronization codes are frequently used in numerical data transmission and storage. Compression by Substring Enumeration (CSE) is a new lossless compression scheme that has turned into a new and unusual application for synchronization codes. CSE is an inherently bitoriented technique. However, since the usual benchmark files are all byte-oriented, CSE incurred a penalty due to a problem calle...

متن کامل

Finding Characteristic Substrings from Compressed Texts

Text mining from large scaled data is of great importance in computer science. In this paper, we consider fundamental problems on text mining from compressed strings, i.e., computing a longest repeating substring, longest non-overlapping repeating substring, most frequent substring, and most frequent non-overlapping substring from a given compressed string. Also, we tackle the following novel p...

متن کامل

String Noninclusion Optimization Problems

For every string inclusion relation there are two optimization problems: find a longest string included in every string of a given finite language, and find a shortest string including every string of a given finite language. As an example, the two well-known pairs of problems, the longest common substring (or subsequence) problem and the shortest common superstring (or supersequence) problem, ...

متن کامل

Implementation of Delta Compression

A Matlab simulation is carried out to verify the compression ratio analysis. Packet Xk and V are generated as two i.i.d. random sequences that follow a discrete uniform distribution between 0 and 255 with a packet length of 1,500 bytes. Packet Xk+1 is generated according to the simplified content generation model. Considering Xk and Xk+1 as two byte strings, our lossless delta compression algor...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2009

Generalized Substring Compression

نویسندگان

چکیده

منابع مشابه

Efficient VLSI Architecture for Lossless Data Compression

Finding Synchronization Codes to Boost Compression by Substring Enumeration

Finding Characteristic Substrings from Compressed Texts

String Noninclusion Optimization Problems

Implementation of Delta Compression

عنوان ژورنال:

اشتراک گذاری